Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

levelzero: only use Sysman queries instead of similar Core API queries #595

Merged
merged 6 commits into from
Dec 4, 2024

Conversation

bgoglin
Copy link
Contributor

@bgoglin bgoglin commented Jun 12, 2023

This goes on top of #594 which might be merged once a compute-runtime release brings a working zesInit(). This PR requires zesInit() to be more widely available, especially the last commit which completely remove the old setting of ZES_ENABLE_SYSMAN=1 in the env.
Now that we require zesInit(), just use ZES queries instead of switching from/to the Core API depending on ZES being available or not.

@bgoglin bgoglin force-pushed the l0-always-sysman branch from f04d462 to ecf31fb Compare June 12, 2023 09:55
@bgoglin bgoglin force-pushed the l0-always-sysman branch 2 times, most recently from d978732 to 980e829 Compare June 20, 2023 09:32
@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 8, 2024

This should be rebased/simplified on top of #695

@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 13, 2024

@saik-intel Now that we use zesInit(), I guess there's no reason to query both zeDevicePciGetPropertiesExt() and zesDevicePciGetProperties() in case one fails but no the other. The later should always be supported now, right?
Same question for zeDeviceGetMemoryProperties() vs zesDeviceEnumMemoryModules()+zesMemoryGetProperties(), is there any reason to prefer one of the other? We just want to know how much memory each device or subdevice has, and if that memory is HBM/DRAM/etc ?

@bgoglin bgoglin changed the title [WIP DNM] L0 always sysman levelzero: only use Sysman queries instead of similar Core API queries Nov 27, 2024
@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 27, 2024

@TApplencourt Could you please run lstopo aurora.xml from this branch using the tarball from https://ci.inria.fr/hwloc/job/basic/view/change-requests/job/PR-595/ ? I'd like to double-check that we don't loose any info when removing non-Sysman queries.

@TApplencourt
Copy link

Of course! Here it's:
aurora.xml.txt

At first glance, Everything looks good to me.

$ applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ./ici/bin/lstopo  | grep "ze"
              CoProc(LevelZero) "ze0"
                CoProc(LevelZero) "ze0.0"
                CoProc(LevelZero) "ze0.1"
              CoProc(LevelZero) "ze1"
                CoProc(LevelZero) "ze1.0"
                CoProc(LevelZero) "ze1.1"
              CoProc(LevelZero) "ze2"
                CoProc(LevelZero) "ze2.0"
                CoProc(LevelZero) "ze2.1"
              CoProc(LevelZero) "ze3"
                CoProc(LevelZero) "ze3.0"
                CoProc(LevelZero) "ze3.1"
              CoProc(LevelZero) "ze4"
                CoProc(LevelZero) "ze4.0"
                CoProc(LevelZero) "ze4.1"
              CoProc(LevelZero) "ze5"
                CoProc(LevelZero) "ze5.0"
                CoProc(LevelZero) "ze5.1"
$applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ./ici/bin/lstopo  | grep "cl"
              CoProc(OpenCL) "opencl0d0"
              CoProc(OpenCL) "opencl0d1"
              CoProc(OpenCL) "opencl0d2"
              CoProc(OpenCL) "opencl0d3"
              CoProc(OpenCL) "opencl0d4"
              CoProc(OpenCL) "opencl0d5"

But Make check reported an error:

applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> cat tests/hwloc/test-suite.log
========================================================================
   hwloc PR-595-20241113.1537.gitc2b2cfd7: tests/hwloc/test-suite.log
========================================================================

# TOTAL: 46
# PASS:  45
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: levelzero
===============

levelzero: ../../../tests/hwloc/levelzero.c:188: int main(void): Assertion `atoi(osdev->name+2) == (int) k' failed.
./wrapper.sh: line 32: 175099 Aborted                 "$@"
FAIL levelzero (exit status: 134)

Composite or FLat doesn't change anything:

applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ZE_FLAT_DEVICE_HIERARCHY=FLAT ./tests/hwloc/levelzero
testing ZE devices
found 1 L0 drivers
found 12 L0 devices in driver #0
found OSDev ze0
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #0 device #0
found OSDev ze0
levelzero: ../../../tests/hwloc/levelzero.c:100: int main(void): Assertion `atoi(osdev->name+2) == (int) k' failed.
Aborted
applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE ./tests/hwloc/levelzero
testing ZE devices
found 1 L0 drivers
found 6 L0 devices in driver #0
found OSDev ze0
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #0 device #0
found OSDev ze1
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #1 device #0
found OSDev ze2
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #2 device #0
found OSDev ze3
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #3 device #0
found OSDev ze4
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #4 device #0
found OSDev ze5
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #5 device #0
testing ZES devices
found 1 L0 ZES drivers
found 6 L0 ZES devices in driver #0
found OSDev ze1
levelzero: ../../../tests/hwloc/levelzero.c:188: int main(void): Assertion `atoi(osdev->name+2) == (int) k' failed.
Aborted

Using:

applenco@x1921c3s2b0n0:~/hwloc-PR-595-20241113.1537.gitc2b2cfd7/build> ze_info
Number of drivers                                 1
  Driver API Version                              1.5
  Driver Version                                  17004696
$intel_compute_runtime/release/996.26

@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 28, 2024

Thanks. There's at least one bug in the test file, I'll try to debug more your report.

Aside of that it looks like ZES fails to report a valid PCI link speed (it shows 0.25GB/s instead of 63 in your case, and nothing on my machines). I'll revert that part and use ZE for this for now.

@bgoglin
Copy link
Contributor Author

bgoglin commented Nov 28, 2024

Could you comment out the assert on line 188 of tests/hwloc/levelzero.c, make -C tests/hwloc levelzero && tests/hwloc/levelzero? I'd like to confirm that devices are reported by ZES and ZE in different orders. That'd be funny, but easy to fix.

@TApplencourt
Copy link

TApplencourt commented Dec 3, 2024

With that:

levelzero: ../../../tests/hwloc/levelzero.c:199: int main(void): Assertion `atoi(value) == (int) j' failed.

Commenting this line makes the tests pass:

testing ZE devices
found 1 L0 drivers
found 6 L0 devices in driver #0
found OSDev ze0
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #0 device #0
found OSDev ze1
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #1 device #0
found OSDev ze2
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #2 device #0
found OSDev ze3
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #3 device #0
found OSDev ze4
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #4 device #0
found OSDev ze5
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #5 device #0
testing ZES devices
found 1 L0 ZES drivers
found 6 L0 ZES devices in driver #0
found OSDev ze5
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #0 device #0
found OSDev ze4
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #1 device #0
found OSDev ze0
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #2 device #0
found OSDev ze1
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #3 device #0
found OSDev ze3
got cpuset 0x0000ffff,0xffffffff,0xf0000000,0x000000ff,0xffffffff,0xfff00000,0x0 for driver #4 device #0
found OSDev ze2
got cpuset 0x0fffffff,0xffffff00,,0x000fffff,0xffffffff for driver #5 device #0

Regarding the order, on Aurora we set ZE_ENABLE_PCI_ID_DEVICE_ORDER.

In quote the doc of https://spec.oneapi.io/level-zero/1.0.4/core/api.html : The number and order of handles returned from this function is affected by the ::ZE_AFFINITY_MASK and ::ZE_ENABLE_PCI_ID_DEVICE_ORDER environment variables.

But that didn't make the "original" tests pass (aka without commenting the assert the test still fail even when ZE_ENABLE_PCI_ID_DEVICE_ORDER is unset), so I guess it's a redhearing.

And yes, I see nothing in the spec saying that ZE and ZES devices will be returned in the same order ( and ZE device can be masked via ZE_AFFINITY_MASK so 🤷🏽 ).

I did some quick experiment, it's indeed not the case.
Here 1/ I enumerate the the ze_device , then do the ze_device to zes_device (using https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md#mapping-core-device-handle-to-sysman-device-handle-with-zesinit-initialization) and then print the serial_number.

On the second, I enumerate zes and print the serial number. The order change indeed.

applenco@x1922c1s6b0n0:~/brice> icpx -g -lze_loader tiny_zeinfo.cpp && ./a.out
Platforn #0: driver_version 17004696
ze Device idx:0 | UUID: 00000000-0000-0000-D1DC-20956DF354ED
ze Device idx:0 | SERIAL_NUMBER: 0x28196d27bfe70246
ze Device idx:1 | UUID: 00000000-0000-0000-EBD3-7F0644A285A1
ze Device idx:1 | SERIAL_NUMBER: 0x2819712755dde1b2
ze Device idx:2 | UUID: 00000000-0000-0000-2CF9-20C1117CDCCF
ze Device idx:2 | SERIAL_NUMBER: 0x28386f27ab8088ce
ze Device idx:3 | UUID: 00000000-0000-0000-DA5F-7E0E09CCFD0F
ze Device idx:3 | SERIAL_NUMBER: 0x507e972745369ca8
ze Device idx:4 | UUID: 00000000-0000-0000-952D-1C2DEB7AC4BF
ze Device idx:4 | SERIAL_NUMBER: 0xac2e9927f7f5c539
ze Device idx:5 | UUID: 00000000-0000-0000-6F57-3A5A50EA4281
ze Device idx:5 | SERIAL_NUMBER: 0xac2c78279b606c85
applenco@x1922c1s6b0n0:~/brice> icpx -g -lze_loader tiny_zesinfo.cpp && ./a.out
zes Device idx:0 | SERIAL_NUMBER: 0x28196d27bfe70246
zes Device idx:1 | SERIAL_NUMBER: 0x2819712755dde1b2
zes Device idx:2 | SERIAL_NUMBER: 0xac2c78279b606c85
zes Device idx:3 | SERIAL_NUMBER: 0xac2e9927f7f5c539
zes Device idx:4 | SERIAL_NUMBER: 0x507e972745369ca8
zes Device idx:5 | SERIAL_NUMBER: 0x28386f27ab8088ce

Sorry for the poor code quality bellow if you want to try to reproduce (but I guess you don't have access to a PVC :(:

applenco@x1922c1s6b0n0:~/brice> cat tiny_zeinfo.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <iostream>
#include <iomanip>

#include <level_zero/ze_api.h>
#include <level_zero/zes_api.h>
#include <cassert>

std::string uuid_to_string(const uint8_t uuid[]) {
  std::stringstream ss;
  ss << std::hex << std::setfill('0') << std::uppercase;
  ss << std::setw(2) << +uuid[15];
  ss << std::setw(2) << +uuid[14];
  ss << std::setw(2) << +uuid[13];
  ss << std::setw(2) << +uuid[12];
  ss << '-';
  ss << std::setw(2) << +uuid[11];
  ss << std::setw(2) << +uuid[10];
  ss << '-';
  ss << std::setw(2) << +uuid[9];
  ss << std::setw(2) << +uuid[8];
  ss << '-';
  ss << std::setw(2) << +uuid[7];
  ss << std::setw(2) << +uuid[6];
  ss << '-';
  ss << std::setw(2) << +uuid[5];
  ss << std::setw(2) << +uuid[4];
  ss << std::setw(2) << +uuid[3];
  ss << std::setw(2) << +uuid[2];
  ss << std::setw(2) << +uuid[1];
  ss << std::setw(2) << +uuid[0];
  return ss.str();
}

zes_device_handle_t getSysmanDeviceHandleFromCoreDeviceHandle(ze_device_handle_t hDevice)
{
    ze_device_properties_t deviceProperties = { ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES };
    ze_result_t result = zeDeviceGetProperties(hDevice, &deviceProperties);
    if (result != ZE_RESULT_SUCCESS) {
        printf("Error: zeDeviceGetProperties failed, result = %d\n", result);
        return nullptr;
    }

    zes_uuid_t uuid = {};
    memcpy(uuid.id, deviceProperties.uuid.id, ZE_MAX_DEVICE_UUID_SIZE);

    uint32_t driverCount = 0;
    result = zesDriverGet(&driverCount, nullptr);
    if (driverCount == 0) {
        printf("Error could not retrieve driver\n");
        exit(-1);
    }
    zes_driver_handle_t* allDrivers = (zes_driver_handle_t*)malloc(driverCount * sizeof(zes_driver_handle_t));
    result = zesDriverGet(&driverCount, allDrivers);
    if (result != ZE_RESULT_SUCCESS) {
        free(allDrivers);
        printf("Error:  zesDriverGet failed, result = %d\n", result);
        return nullptr;
    }

    zes_device_handle_t phSysmanDevice = nullptr;
    ze_bool_t onSubdevice = false;
    uint32_t subdeviceId = 0;
    for (int it = 0; it < driverCount; it++) {
        result = zesDriverGetDeviceByUuidExp(allDrivers[it], uuid, &phSysmanDevice, &onSubdevice, &subdeviceId);
        if (result == ZE_RESULT_SUCCESS && (phSysmanDevice != nullptr)) {
            break;
        }
    }
    free(allDrivers);

    return phSysmanDevice;
}

#define ZE_CHECK(result, message) \
  do { \
    assert(ZE_RESULT_SUCCESS == (result) && message); \
  } while (0)

int main()
{
  ze_result_t err;
  //  _              _                      _
  // |_) |  _. _|_ _|_ _  ._ ._ _    ()    | \  _     o  _  _
  // |   | (_|  |_  | (_) |  | | |   (_X   |_/ (/_ \/ | (_ (/_
  //
  // Initialize the driver
  err  = zeInit(ZE_INIT_FLAG_GPU_ONLY);
  ZE_CHECK(err, "zeInit");

  err  = zesInit(ZE_INIT_FLAG_GPU_ONLY);
  ZE_CHECK(err, "zesInit");

  // Discover all the driver instances
  uint32_t driverCount = 0;
  err = zeDriverGet(&driverCount, NULL);
  ZE_CHECK(err, "zeDriverGet");

  //Now where the phDrivers
  ze_driver_handle_t* phDrivers = (ze_driver_handle_t*) malloc(driverCount * sizeof(ze_driver_handle_t));
  err = zeDriverGet(&driverCount, phDrivers);
  ZE_CHECK(err, "zeDriverGet");

  for(uint32_t driver_idx = 0; driver_idx < driverCount; driver_idx++) {

    ze_driver_handle_t driver = phDrivers[driver_idx];

    ze_driver_properties_t driver_properties ;
    err = zeDriverGetProperties(driver, &driver_properties);
    ZE_CHECK(err, "zeDriverGetProperties");

    printf("Platforn #%d: driver_version %u\n", driver_idx, driver_properties.driverVersion);

    /* - - - -
    Device
    - - - - */

    uint32_t deviceCount = 0;
    err = zeDeviceGet(driver, &deviceCount, NULL);
    ZE_CHECK(err, "zeDeviceGet");

    ze_device_handle_t* phDevices = (ze_device_handle_t*) malloc(deviceCount * sizeof(ze_device_handle_t));
    err = zeDeviceGet(driver, &deviceCount, phDevices);
    ZE_CHECK(err, "zeDeviceGet");

    for(uint32_t device_idx = 0;  device_idx < deviceCount; device_idx++) {
	    {
	ze_device_properties_t device_properties;
        err = zeDeviceGetProperties(phDevices[device_idx], &device_properties);
        ZE_CHECK(err, "zeDeviceGetProperties");
  	std::cout<< "Device idx:" << device_idx << " | UUID: " << uuid_to_string(device_properties.uuid.id) << std::endl;
	    }

	    {
     	zes_device_properties_t device_properties;

	err = zesDeviceGetProperties(getSysmanDeviceHandleFromCoreDeviceHandle(phDevices[device_idx]), &device_properties);
        ZE_CHECK(err, "zesDeviceGetProperties");
        std::cout<< "Device idx:" << device_idx << " | SERIAL_NUMBER: " << std::string(device_properties.serialNumber) << std::endl;
		}
    }
    free(phDevices);
  }

  free(phDrivers);
  return 0;
}
applenco@x1922c1s6b0n0:~/brice> cat tiny_zeinfo.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <iostream>
#include <iomanip>

#include <level_zero/ze_api.h>
#include <level_zero/zes_api.h>
#include <cassert>

std::string uuid_to_string(const uint8_t uuid[]) {
  std::stringstream ss;
  ss << std::hex << std::setfill('0') << std::uppercase;
  ss << std::setw(2) << +uuid[15];
  ss << std::setw(2) << +uuid[14];
  ss << std::setw(2) << +uuid[13];
  ss << std::setw(2) << +uuid[12];
  ss << '-';
  ss << std::setw(2) << +uuid[11];
  ss << std::setw(2) << +uuid[10];
  ss << '-';
  ss << std::setw(2) << +uuid[9];
  ss << std::setw(2) << +uuid[8];
  ss << '-';
  ss << std::setw(2) << +uuid[7];
  ss << std::setw(2) << +uuid[6];
  ss << '-';
  ss << std::setw(2) << +uuid[5];
  ss << std::setw(2) << +uuid[4];
  ss << std::setw(2) << +uuid[3];
  ss << std::setw(2) << +uuid[2];
  ss << std::setw(2) << +uuid[1];
  ss << std::setw(2) << +uuid[0];
  return ss.str();
}

zes_device_handle_t getSysmanDeviceHandleFromCoreDeviceHandle(ze_device_handle_t hDevice)
{
    ze_device_properties_t deviceProperties = { ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES };
    ze_result_t result = zeDeviceGetProperties(hDevice, &deviceProperties);
    if (result != ZE_RESULT_SUCCESS) {
        printf("Error: zeDeviceGetProperties failed, result = %d\n", result);
        return nullptr;
    }

    zes_uuid_t uuid = {};
    memcpy(uuid.id, deviceProperties.uuid.id, ZE_MAX_DEVICE_UUID_SIZE);

    uint32_t driverCount = 0;
    result = zesDriverGet(&driverCount, nullptr);
    if (driverCount == 0) {
        printf("Error could not retrieve driver\n");
        exit(-1);
    }
    zes_driver_handle_t* allDrivers = (zes_driver_handle_t*)malloc(driverCount * sizeof(zes_driver_handle_t));
    result = zesDriverGet(&driverCount, allDrivers);
    if (result != ZE_RESULT_SUCCESS) {
        free(allDrivers);
        printf("Error:  zesDriverGet failed, result = %d\n", result);
        return nullptr;
    }

    zes_device_handle_t phSysmanDevice = nullptr;
    ze_bool_t onSubdevice = false;
    uint32_t subdeviceId = 0;
    for (int it = 0; it < driverCount; it++) {
        result = zesDriverGetDeviceByUuidExp(allDrivers[it], uuid, &phSysmanDevice, &onSubdevice, &subdeviceId);
        if (result == ZE_RESULT_SUCCESS && (phSysmanDevice != nullptr)) {
            break;
        }
    }
    free(allDrivers);

    return phSysmanDevice;
}

#define ZE_CHECK(result, message) \
  do { \
    assert(ZE_RESULT_SUCCESS == (result) && message); \
  } while (0)

int main()
{
  ze_result_t err;
  //  _              _                      _
  // |_) |  _. _|_ _|_ _  ._ ._ _    ()    | \  _     o  _  _
  // |   | (_|  |_  | (_) |  | | |   (_X   |_/ (/_ \/ | (_ (/_
  //
  // Initialize the driver
  err  = zeInit(ZE_INIT_FLAG_GPU_ONLY);
  ZE_CHECK(err, "zeInit");

  err  = zesInit(ZE_INIT_FLAG_GPU_ONLY);
  ZE_CHECK(err, "zesInit");

  // Discover all the driver instances
  uint32_t driverCount = 0;
  err = zeDriverGet(&driverCount, NULL);
  ZE_CHECK(err, "zeDriverGet");

  //Now where the phDrivers
  ze_driver_handle_t* phDrivers = (ze_driver_handle_t*) malloc(driverCount * sizeof(ze_driver_handle_t));
  err = zeDriverGet(&driverCount, phDrivers);
  ZE_CHECK(err, "zeDriverGet");

  for(uint32_t driver_idx = 0; driver_idx < driverCount; driver_idx++) {

    ze_driver_handle_t driver = phDrivers[driver_idx];

    ze_driver_properties_t driver_properties ;
    err = zeDriverGetProperties(driver, &driver_properties);
    ZE_CHECK(err, "zeDriverGetProperties");

    printf("Platforn #%d: driver_version %u\n", driver_idx, driver_properties.driverVersion);

    /* - - - -
    Device
    - - - - */


    uint32_t deviceCount = 0;
    err = zeDeviceGet(driver, &deviceCount, NULL);
    ZE_CHECK(err, "zeDeviceGet");

    ze_device_handle_t* phDevices = (ze_device_handle_t*) malloc(deviceCount * sizeof(ze_device_handle_t));
    err = zeDeviceGet(driver, &deviceCount, phDevices);
    ZE_CHECK(err, "zeDeviceGet");

    for(uint32_t device_idx = 0;  device_idx < deviceCount; device_idx++) {
	    {
	ze_device_properties_t device_properties;
        err = zeDeviceGetProperties(phDevices[device_idx], &device_properties);
        ZE_CHECK(err, "zeDeviceGetProperties");
  	std::cout<< "Device idx:" << device_idx << " | UUID: " << uuid_to_string(device_properties.uuid.id) << std::endl;
	    }

	    {
     	zes_device_properties_t device_properties;

	err = zesDeviceGetProperties(getSysmanDeviceHandleFromCoreDeviceHandle(phDevices[device_idx]), &device_properties);
        ZE_CHECK(err, "zesDeviceGetProperties");
        std::cout<< "Device idx:" << device_idx << " | SERIAL_NUMBER: " << std::string(device_properties.serialNumber) << std::endl;
		}
    }
    free(phDevices);
  }

  free(phDrivers);
  return 0;
}
applenco@x1922c1s6b0n0:~/brice> cat tiny_zesinfo.c
#include <stdlib.h>

#include <iostream>
#include <iomanip>

#include <level_zero/ze_api.h>
#include <level_zero/zes_api.h>
#include <cassert>

std::string uuid_to_string(const uint8_t uuid[]) {
  std::stringstream ss;
  ss << std::hex << std::setfill('0') << std::uppercase;
  ss << std::setw(2) << +uuid[15];
  ss << std::setw(2) << +uuid[14];
  ss << std::setw(2) << +uuid[13];
  ss << std::setw(2) << +uuid[12];
  ss << '-';
  ss << std::setw(2) << +uuid[11];
  ss << std::setw(2) << +uuid[10];
  ss << '-';
  ss << std::setw(2) << +uuid[9];
  ss << std::setw(2) << +uuid[8];
  ss << '-';
  ss << std::setw(2) << +uuid[7];
  ss << std::setw(2) << +uuid[6];
  ss << '-';
  ss << std::setw(2) << +uuid[5];
  ss << std::setw(2) << +uuid[4];
  ss << std::setw(2) << +uuid[3];
  ss << std::setw(2) << +uuid[2];
  ss << std::setw(2) << +uuid[1];
  ss << std::setw(2) << +uuid[0];
  return ss.str();
}


#define ZE_CHECK(result, message) \
  do { \
    assert(ZE_RESULT_SUCCESS == (result) && message); \
  } while (0)

int main()
{
  ze_result_t err;
  //  _              _                      _
  // |_) |  _. _|_ _|_ _  ._ ._ _    ()    | \  _     o  _  _
  // |   | (_|  |_  | (_) |  | | |   (_X   |_/ (/_ \/ | (_ (/_
  //
  // Initialize the driver
  err  = zesInit(ZE_INIT_FLAG_GPU_ONLY);
  ZE_CHECK(err, "zeInit");

  // Discover all the driver instances
  uint32_t driverCount = 0;
  err = zesDriverGet(&driverCount, NULL);
  ZE_CHECK(err, "zesDriverGet");
  std::cout << driverCount  << std::endl;
  //Now where the phDrivers
  zes_driver_handle_t* phDrivers = (zes_driver_handle_t*) malloc(driverCount * sizeof(zes_driver_handle_t));
  err = zesDriverGet(&driverCount, phDrivers);
  ZE_CHECK(err, "zesDriverGet");
  std::cout << driverCount << std::endl;
  for(uint32_t driver_idx = 0; driver_idx < driverCount; driver_idx++) {

    zes_driver_handle_t driver = phDrivers[driver_idx];
    /* - - - -
    Device
    - - - - */

    // if count is zero, then the driver will update the value with the total number of devices available.
    uint32_t deviceCount = 0;
    err = zesDeviceGet(driver, &deviceCount, NULL);
    ZE_CHECK(err, "zesDeviceGet");

    zes_device_handle_t* phDevices = (zes_device_handle_t*) malloc(deviceCount * sizeof(zes_device_handle_t));
    err = zesDeviceGet(driver, &deviceCount, phDevices);
    ZE_CHECK(err, "zesDeviceGet");

    for(uint32_t device_idx = 0;  device_idx < deviceCount; device_idx++) {
        zes_device_properties_t device_properties;
        err = zesDeviceGetProperties(phDevices[device_idx], &device_properties);
        ZE_CHECK(err, "zesDeviceGetProperties");
  	std::cout<< "Device idx:" << device_idx << " | SERIAL_NUMBER: " << std::string(device_properties.serialNumber) << std::endl;
    }
    free(phDevices);
  }

  free(phDrivers);
  return 0;
}

Didn't seem to break anything so far.

Signed-off-by: Brice Goglin <[email protected]>
ZE and ZES may return devices in different orders.

open-mpi#595 (comment)

Signed-off-by: Brice Goglin <[email protected]>
Now that zesInit() is mandatory, don't bother falling back
to the core API, Sysman shouldn't fail.

Signed-off-by: Brice Goglin <[email protected]>
We don't need it anymore.

Signed-off-by: Brice Goglin <[email protected]>
@bgoglin bgoglin merged commit 3a7153f into open-mpi:master Dec 4, 2024
1 check passed
@bgoglin bgoglin deleted the l0-always-sysman branch December 4, 2024 10:36
@bgoglin
Copy link
Contributor Author

bgoglin commented Dec 4, 2024

Thanks a lot @TApplencourt!

bgoglin added a commit that referenced this pull request Dec 4, 2024
ZE and ZES may return devices in different orders.

#595 (comment)

Signed-off-by: Brice Goglin <[email protected]>
(cherry picked from commit efcd681)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants